### Problem 1)

| Pipeline  | Total single-cycle datapath latency (Total useful work) | Latch latency | СРІ |
|-----------|---------------------------------------------------------|---------------|-----|
| 5 stages  | 2500 ps                                                 | 50 ps         | 1.2 |
| 14 stages | 2500 ps                                                 | 50 ps         | 1.5 |

i) Give the maximum frequency for the 5 stage and 14 stage pipeline implementations.

5 stage: Useful work per stage = 2500ps / 5 = 500ps Minimum clock period = 500ps + 50ps = 550ps. Maximum frequency = 1.8 GHz

14 stage: Useful work per stage = 2500ps / 14 = 179 ps Minimum clock period = 179 ps + 50ps = 229 ps. Maximum frequency = 4.36 GHz

ii) Give the speedup of the 14 stage pipeline implementation over the 5 stage pipeline implementation.

Speedup<sub>14 stage/5 stage</sub> = Performance<sub>14 stage</sub> / Performance<sub>5 stage</sub> = (4.36 GHZ / (1.5 \* # instructions)) / 1.8 GHz / (1.2 \* # instructions) = 2.906 / 1.5 = 1.94x

# Problem 2) Consider the following pipeline:

| Fetch Delay Decode Delay2 Execute Delay3 Memory/ Delay4 Write Branch Back |
|---------------------------------------------------------------------------|
|---------------------------------------------------------------------------|

i) Assuming branch outcomes are determined in the Memory stage, and mispredicted branches are corrected in that stage, how many instructions must be flushed upon a mispredicted branch?

6

ii) Assume no other hazards, what is the CPI for this pipeline, assuming 20% of instructions are branches and assuming 90% of branches are correctly predicted?

80% of instructions taken 1 cycle, 20% of instructions and branches and of these branches, 90% take 1 cycle while 10% take 1 cycle + 6 cycles of penalty

```
.8 + .2 * (.9 * 1 + .1 * (1 + 6)) = 1.12 cycles per instruction
```

iii) Assume load results are returned to the pipeline in the Memory stage. How many stall cycles will occur when a load is immediately followed by an instruction that consumes the result of the load.

#### Example:

lw R1, 0(R2) add R3, R1, R4

2 stall cycles

### Problem 3)

Use the following pseudo-assembly code for the following problem:

```
LW R1, 0(R8); load X into R1
Branch to FOO if (taken if R1 is even multiple of 2); b0
ADDI R3, R3, 1
FOO: Branch to SNA if (taken if R1 is even multiple of 4); b1
ADDI R4, R4, 1
SNA: ADDI R8, R8, 4; move to next X
```

Assume the above sequence is in a loop and completely ignore the loop back branch (not shown). Assume R0 initially holds the value 0 and the contents of memory are shown:

| Memory value: | 6 | 8 | 7 | 12 |
|---------------|---|---|---|----|
| Address:      | 0 | 4 | 8 | 12 |

Consider the GAg predictor with a 2-bit global history register. Assume the actual outcome of each new branch is shifted in from the left. Assume each entry in the pattern history table is a two-bit, saturating up-down counter.





# i) Complete the following:

| Branch |                              | 6  | 8  | 7  | 12 |
|--------|------------------------------|----|----|----|----|
| b0     | History (before branch)      | 01 | 10 | 11 | 00 |
|        | PHT value<br>(before branch) | 11 | 00 | 00 | 01 |
|        | Prediction                   | Т  | NT | NT | NT |
|        | Actual<br>Outcome            | Т  | Т  | NT | Т  |
|        | PHT value<br>(after branch)  | 11 | 01 | 00 | 01 |
|        | History (after branch)       | 11 | 01 | 10 | 01 |
| b1     | History (before branch)      | 11 | 01 | 10 | 01 |
|        | PHT value<br>(before branch) | 01 | 11 | 01 | 11 |
|        | Prediction                   | NT | Т  | NT | Т  |
|        | Actual<br>Outcome            | NT | Т  | NT | Т  |
|        | PHT value<br>(after branch)  | 00 | 11 | 00 | 11 |
|        | History (after branch)       | 10 | 11 | 00 | 11 |

ii) Explain how *aliasing* in a branch predictor can cause mispredictions.

Problem 4). Assume the following branch is predicted by a perceptron predictor such as the one discussed in class and in the paper on Canvas.

BGT R1, 3, TARG; branch to TARG if R1 > 3



Perceptron Model (from Jimenez and Lin)

For the following values of R1, show in a table, the values of the weights, the output Y, the prediction (Taken vs. Not taken) and the actual output. Assume four bits of history are used. Start with a history of 0101. (Assume x4 is the input for the most recent branch and new, actual branch outcomes are shifted in on the right (i.e. history bits are shirted to the left).

| X | w0 (bias) | w1 | w2 | w3 | w4 | у  | Prediction | Outcome |
|---|-----------|----|----|----|----|----|------------|---------|
| 4 | 1         | 2  | 3  | 4  | -1 | -3 | NT         | Т       |
| 7 | 2         | 1  | 4  | 3  | 0  | 2  | Т          | Т       |
| 3 | 3         | 2  | 3  | 4  | 1  | 9  | Т          | NT      |
| 1 | 2         | 3  | 2  | 3  | 0  | 10 | Т          | NT      |
| 4 | 1         | 2  | 1  | 2  | 1  | 1  | Т          | Т       |
| 7 | 2         | 3  | 2  | 1  | 0  | 2  | Т          | Т       |
| 5 | 3         | 4  | 1  | 0  | 1  | -1 | NT         | Т       |

R1=4

History initially: NT T NT T

Weights: 1 2 3 4 -1

Y = 1 + -1 \* 2 + 1 \* 3 + -1 \* 4 + 1 \* -1 = 1 + -2 + 3 + -4 + -1 = -3

R1=7

History: T NT T T

Weights: 2 1 4 3 0

Y = 2 + 1 \* 1 + -1 \* 4 + 1 \* 3 + 1 \* 0 = 2 + 1 + -4 + 3 + 0

R1 = 3

History: NT T T

Weights: 3 2 3 4 1

Y = 3 + -2 + 3 + 4 + 1 = 9

R1 = 1

History: T T NT

Weights: 2 3 2 3 0

Y = 2 + 3 + 2 + 3 + -1 \* 0 = 10

R1 = 4

History: T T NT NT

Weights: 1 2 1 2 1

Y = 1 + 2 + 1 + -2 + -1 = 1

R1 = 7

History: T NT NT T

Weights: 2 3 2 1 0

Y = 2 + 3 + -2 + -1 + 0 = 2

R1 = 5

History: NT NT T T

Weights: 3 4 1 0 1

Y = 3 + -4 + -1 + 0 + 1 = -1

### Problem 5)

i) For the instructions below, show in which cycle each instruction would issue on a traditional, in-order 5-stage pipeline. Show which cycles are stalls in which no instructions are issued. Assume the first instruction is issued in cycle 1.

```
lw r3, 4(r2)
add r4, r3, r7
lw r5, 0(r2)
sub r7, r5, r10
lw r4, 0(r2)
lw r11, 24(r4)
and r7, r5, r10
beq r11, r12, Loop
```

(Starting numbering of cycles at 0):

```
0 lw r3, 4(r2)
1 stall
2 add r4, r3, r7
3 lw r5, 0(r2)
4 stall
5 sub r7, r5, r10
6 lw r4, 0(r2)
7 stall
8 lw r11, 24(r4)
9 and r7, r5, r10
10 beq r11, r12, Loop
```

ii) Reorder the instructions above to minimize the number of stall cycles on our example 5-stage RISC pipeline.

```
lw r3, 4(r2)
lw r5, 0(r2)
add r4, r3, r7
lw r4, 0(r2)
sub r7, r5, r10
lw r11, 24(r4)
and r7, r5, r10
beg r11, r12, Loop
```

Problem 6) Consider the state of the following out-of-order processor below. The Reservation Stations, Register Alias Table and Reorder Buffer are shown. Notice that architectural register names have already been reordered to physical register names. Assume the instructions are numbered in program order (i.e. 101 comes before 102, etc.). Consider also the latencies given in the table for each instruction. The latencies in the table as such that if an instruction with a latency of 1 issues in cycle i, a consumer can issue in cycle i+1.

|            | Waiting Src1 | Waiting Src2 | Dest |
|------------|--------------|--------------|------|
| 101 add    | -            | -            | PR9  |
| 102 mul    | PR9          | -            | PR3  |
| 103 branch | PR9          | PR3          | -    |
| 105 add    | -            | -            | PR4  |
| 106 store  | -            | PR4          | -    |

## **Reservation Stations**

| R1 | PR3 |
|----|-----|
| R2 | PR9 |
| R3 | AR3 |
| R4 | PR5 |

## **RAT**

|  | 106 Store | 105 Branch | 104 Load | 103 Branch | 102 mul | 101 add |
|--|-----------|------------|----------|------------|---------|---------|
|--|-----------|------------|----------|------------|---------|---------|

### ROB

| opcode | latency                           |
|--------|-----------------------------------|
| add    | 2                                 |
| mul    | 3                                 |
| branch | -<br>(no register<br>destination) |
| Store  | -<br>(no register<br>destination) |

For the instructions in the Reservation Stations in the digram on the previous page, show in which cycle each instruction would issue assuming the latencies above.

(Again starting the numbering of cycles at 0):

- 0 101 add
- 1 105 add
- 2 102 mul
- 3 106 store
- 4
- 5 103 branch